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v^TOM CLANCY'S 
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Rebirth of a loved franchise that is 
precious to a lot of hard-core gamers 

Gameplay driven 

Destruction as a core gameplay mechanic 

• Destruction must be consistent between 
platforms 

First Rainbow Six shipping with the engine 

• Lots of legacy code from previous 
prototypes 



SIEGE TECH MISSION 

• Targeting 60FPS: 

• GPU : 14ms average on non combat situations 

• CPU : Max 38ms lineartime on CPU (consoles) 

• Provide scalable destruction 

• Ship at a higher resolution than 720p on all consoles 

• Commit to 4K on PC at a decent framerate 

• Provide a strong PC version on a console oriented production 

• Fitting on 1 GB of RAM becomes a challenge with current gen 


SIEGE IS A LIVE GAME 


• Graphics features can be continually 
iterated on 

• Test new tech to improve look or comfort 

• Auto-exposure for players 

• Be careful not to break things! 
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SIEGE FRAME 


• Hierarchical view of a GPU frame 

• Average 5ms spent on geometry rendering 

• Heavyuseofculling! 

• Shadow caching! 

• Average 5ms spent on lighting (SSR included) 

• Checkerboard rendering helps! 

• SSAO & SSR ray trace done in async 


• Average 4ms spent on post processing/other full screen processing 


GRAPHIC 

PIPE 


ASYNCH 

PIPE 




SIEGE FRAME 

• Hierarchical view of a CPU critical path 

• 1 0 ms avg on the critical path 

• All passes and tasks able to fork and join to minimize critical path 

• Shadow caching! 

• Max 4ms linear spent on opaque pass 

• Material based draw call system! 
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OPAQUE RENDERING 
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SHADOW RENDERING 


• All shadows are cache based 

• Use cached Hi-Z for culling 

• Sunlight shadow done in full resolution 

• Separate pass to relieve lighting resolve VGPR pressure 

• Uses Hi-Z representation of the cached shadow map to reduce the work per pixel 

• Local lights are resolved in a quarter resolution 

• Resolved results stored in a texture array 

• Lower VGPR usage on light accumulation 

• Bilateral upscale 


SHADOW RENDERING - SUN / MOON 

• Shadow map containing all static objects built on load 



SHADOW RENDERING- j/MOON 

• Ability to scale shadow cost by mixing cascades with static map 

• Static Hi-Z shadow map always used for dynamic object culling 

• On Xbox One : 

• 1 st cascades are fully dynamic (not enough resolution with 6K) 

• 2 nd and 3rd cascades renders dynamic objects only and blend with the static shadow map 

• 4th cascade is substituted by the static shadow map 


SHADOW RENDERING - LOCAL PROJECTORS 

• We handle a maximum of 8 visible shadowed local lights 




LIGHTING 

• Uses a clustered structure on the frustum: 

• 32x32 pixels based tile 

• Z exponential distribution 

• Hierarchical culling of light volume to fill the structure 

• Local cubemaps regarded as lights 

• Shadows, cubemaps and gobos reside in textures arrays 

• Deferred uses pre-resolved shadow texture array 

• Forward uses shadows depth buffer array 
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RAINBOW SIX DESTRUCTION 


ART DIRECTION 

• When destruction happens you need to feel that something big went on! 

FLOORS & WALLS 

• Procedural^ generated unique geometry 

• Poking holes degrades occlusion efficiency 

DESTRUCTIBLE PROPS & DEBRIS 

• Generally smaller meshes but in great numbers 

• Can be instanced or unique 
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RAINBOW SIX DESTRUCTION 


• Early prototypes were largely graphic bound (CPU and GPU) on average 

• PCDX11 deferred contexts aren’t that great at scaling 

• Material based draw call system 

• Materials define destruction properties 

• Debris share material 

• See [Haar&Altonenl5] 

• In need of granularity in culling to keep up with destruction 


UNIFIED BUFFERS 


• A lot of resources in Rainbow Six reside in an unified buffer of some sort: 

• Unified Vertex Buffer 

• Unified Index Buffer 

• Unified Constant Buffer 
■ ■ ■ 

• Structured buffers built on top of raw buffers with auto generated code: 

• Using C++ data descriptors for GPU unified data 

• Meta data passed on to specify access pattern 


UNIFIED BUFFERS -CONSTANT 


// Cpp Code 

void EntityConst ant Buff erBridge : : CreateDescriptor f ConstantBuff erDescriptor* descriptor ) 

{ 

popBeginMapShaderParamet ersAndSet Usage (EntityConst ant Buff erBridge , descriptor, Entity) j 
popMapShaderParameter ( lp g_Nod eWorld 11 , m_WorldMatrix) j 

popMapShaderParameterf "g_InvScaleSqr 11 ^ m_InvScaleSqr) j 

} 

// -- 
// ... 

// HISL GeneratedCode 

By t eAd d r e s s B uf f e r g_U n if ie dCo n st a nt B uf f e r; 
void LoadEKInstanceProvider(uint index j 

{ 

uint offset = Mad_U24( index, EK.IN5TANCEPROVIDER_5TRIDE, E KI N5TANC E P RWIOE R_G LOBALBUFFE R_QF F 5 ET ) j, 
g_Nod eWorld = ToFL0AT4x4( 

g_UnifiedConstantBuffer. Load4(off set 4 GxG), 
g_UnifiedConstantBuffer. Load4(off set 4 6x10), 
g_U n if iedConst ant Buffer., Load4(offset 4 6x20), 
g_U n if i e dCo n s t a n t B uf f e r , Lo a d4 ( of f s et 4 Gx3G' ) 

)j 

g_InvScaleSqr = ToFL0AT4(g_UnifiedConstantBuffer . Load4(offset 4 0x4@))j 

} 

f f 

f / ~ ~ 




UNIFIED BUFFERS -BENEFITS 


• Complete control over data layout 

• We can easily experiment with different data type accesses (AOS, SOA, 
Structure of u32 Arrays...) 

• Custom packing and support for new data types 

• High level API supports broadcasting values 

• Code auto-generation allows us to migrate to new access patterns easily 


MATERIAL BASED DRAW CALLS 


• Geometry and constants are unified 

•A draw call is then defined by: 

• Shaders 

• Non-Unified Resources (Textures, etc...) 

• Render States (Sampler States, Raster States) 

• Elements that share the above are batched together 

• Passes that don’t use a subset of the resources and states are further batched together 


GATHERING DRAW CALLS 


• On initialization, each submesh instance is mapped to 3 batches: Normal, Shadow and Visibility 

• The batch types used to mask non necessary data 

• Each batch will correspond to a MultiDrawIndexedlndirect command 



NORMAL BATCH 1 


NORMAL BATCH 2 


NORMAL BATCH 3 


VISIBILITY BATCH 2 



SUBMESH 
INSTANCE X 





GATHERING DRAW CALLS 


• Each submesh instance has a globally unique index: 

• Index used to fetch all data 

• Multiple indirection needed 


SUBMESH INSTANCE INDEX 


MESH INDEX 
SUBMESH INDEX 
ENTITY INDEX 
MESH INSTANCE INDEX 



BASE VERTEX OFFSET 
BASE CLUSTER OFFSET 



ENTITY MATRIX 
ENTITY INVSCALE 







GATHERING DRAW CALLS 


• For each pass gather the submesh instance 
index into a dynamic buffer: 

• Each pass maps to one batch type exclusively 

• Bufferfilled in multithreaded jobs (1 .5ms linear) 

• Extra data to perform culling is added: 

• MultiDrawIndexedlndirect entry 

• New index buffer offset 

• Additional culling flags 


PASS BUFFER 


SUBMESH INSTANCE INDEX 
DRAW BUFFER OFFSET 
INDEX BUFFER OFFSET 
CULLING FLAGS 

SUBMESH INSTANCE INDEX 
DRAW BUFFER OFFSET 
INDEX BUFFER OFFSET 
CULLING FLAGS 



PERFORMING CULLING 


We define multiple types of culling: 

• Level 1: Submesh instance culling 

• Level 2: Submesh chunck culling 

• Level 3: Submesh triangle culling 
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CULLING 


LEVEL 2 
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PERFORMING CULLING 


LEVEL 1 CULLING 

SCREEN SPACE SIZE 
CULLING 

DISTANCE CULLING 
FRUSTUM CULLING 


LEVEL 2 CULLING 

SCREEN SPACE SIZE 
CULLING 


FRUSTUM CULLING 
ORIENTATION CULLING 


LEVEL 3 CULLING 

TRIANGLE NORMAL 
CULLING 


OCCLUSION CULLING 


OCCLUSION CULLING 




PERFORMING DRAW CALLS 


PASS BUFFER 

SUBMESH INSTANCE 1 
DRAW BUFFER OFFSET 
INDEX BUFFER OFFSET 
CULLING FLAGS 

SUBMESH INSTANCE 2 
DRAW BUFFER OFFSET 
INDEX BUFFER OFFSET 
CULLING FLAGS 



■ ■ ■ ■ 






PERFORMING A DRAW CALL 




DRAW CALL ENTRY PRAMETERS 

INDEX COUNT (N) 
INSTANCE COUNT (1..N) 
START INDEX 
BASE VERTEX (0) 

I START INSTANCE 1 
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VERTEX DATA 







PERFORMING A DRAW CALL 


Culling compute shader 
writes out instance indices 
in a Per Instance Buffer 


struct VertexShaderlnput 

{ 

uint Primitivelnstanceldx : Perlnstancelnfo; 
uint VertexID : SV_VertexID; 

}; 

void GetVertexFromUnifiedBuffen: uint clusterFirstVertexOffset y uint vertexldx,. out UnpackedVertexFormat vertex) 

{ 

uint offset = clusterFirstVertexOffset + vertexldx * VERT EX_F0RMAT_5T R I OE ^ 
vertex. Position = ToFLOAT3(g_VertexBuffer . Load3 (offset + 0))j 
vertex. Normal = Toll BYTE4-(g_Ve rt exBuf f er . Load (off set + 12) )i 
vertex.TexCoordO = ToFLOAT16_2(g_VertexBuffer „ Load (offset + 16) )^ 
vertex. VertexID = vertexldx; 

} 




PERFORMING A DRAW CALL 


struct VertexShaderOutput 

{ 

nointerpolation uint4 UniformsGffsets.; 

} 

VertexShaderOutput V5Main(Vertex5haderInput input) 

loadUniformsi; input . Primitivelnstanceldx|) ; 

Un pa eked Vert exFoigigt vf Unpacked ; 

Get Ve rtexF romU n if i e d B uf f e r ( g_C lusterFir st Ve rt e xOff s et , 
VertexShaderOutput vsOutj; 

vsOut . 1) n if o rmsOf f s et s = g_U n if o rmsOf f s et s ; 


ReadFirstLane is used in the pixel shader when 
loading UCB values with UnformsOffsets to be 

v. 

ab e to use the GCN scalar unit & registers 


input .VertexIDj. vfUnpacked)^ 
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VISUALISATION 
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FUTURE WORK 


• Pushing empty draw calls has a cost 

• We try to hide it on consoles using async jobs 

• Specifying the number of draw calls on the GPU would be the next step 

• Using bindless resources to further batch draw calls 

• Moving most of the scene graph traversal to the GPU 

• LoD selection logic 
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CHECKERBOARD RENDERING 


60 FPS MADE EASY 


• We wanted 60FPS early in production 

• First playable was running at around 50 on consoles 

• GO FPS average was hit a couple weeks after! 

• Killzone approach seemed like a good idea to start with (see [Valiantl 4]) 

• Keeping nearly the same budget per pixel as a 30 FPS game for screen pixels rendering 

• EQAA based, we wanted it on PC too (low end and 4K support) 

•Big “quick” win without having a major quality impact 

• Silently enabled to see if people noticed 


TEMPORAL INTERLACED RENDERING 


• To target 1920x1 080: 

• We render geometry and lighting to 960x1 080 rendertarget 

•3D velocity vector per rendered pixel 

• R12G12B8format 

• Projection matrix is offset each frame 

• Need to divide x gradient by 2 to have similartexture filtering 


float4 5aiiipleTexture2D(Texture2D SamplerState Sj float2 coord j 
{ 

#if I INTE RL AC E D_RE AIDE RING 

return t . 5ample(Sj coord) j 
#else 

float2 dx = ddx( coord) j 
float2 dy = ddy( coord ) i 
return t . 5ampleGrad(Sj. coord j dx/2j dy)i 
#endif 

} 



TEMPORAL INTERLACED RENDERING 


• Things not represented by motion on screen need to be dealt with 

• Tried to maintain lighting/shadow changes to handle them better 

• Color clamping (See [Karisl 4]) 

• Data tweaked so alternating effects take place over at least two frames 

• Police car flash lights, light flickering 

• Flickering oscillators modified to avoid single frame 0 to 1 transitions 

•Aliasing on vertical lines 

• Not that easy after all! 


CHECKERBOARD RENDERING 


• Base idea came about to solve aliasing issues 

• Experimented on a series of images to first test quality 

• For most images PSNR was better using a checkerboard pattern: 

• Visually the results were more pleasing too 

• The idea of using MSAA 2X was bouncing around since the beginning 

• We made a push for it for E3 2 01 5 


LINE NEIGHBORS INTERPOLATION 





CHECKERBOARD NEIGHBORS 

INTERPOLATION 





LINE NEIGHBORS INTERPOLATION 







CHECKERBOARD NEIGHBORS 

INTERPOLATION 





LINE NEIGHBORS INTERPOLATION 




CHECKERBOARD NEIGHBORS 

INTERPOLATION 



CHECKERBOARD RENDERING IMPLEMENTATION 

• Rendering to a Va size (V 2 width by V 2 height) resolution 


with MSAA 2X: 

• We end up with half the samples of the full resolution image 

• D3D MSAA 2X standard pattern 

• 2 Color and Z samples 

• Sample modifier or SV_Samplelndex input to enforce 
rendering all sample 

• Each sample falls on the exact pixel center of full screen 
render target 


Standard 2 Sample Pattern 


-8 -7 -6 -5 -4 -3 -2 ‘1 0 1 2 3 4 5 6 7 



CHECKERBOARD RENDERING BONUS 


• Particle effects can be easily evaluated per pixel instead of per sample 

• You can fit a lot more stuff in ESRAM ! 

• No need to fixup gradients in the shaders! 


CHECKERBOARD RENDERING IMPLEMENTATION 


GRADIENT 




Texture gradients are 
represented by red lines 




With LOD bias 


CHECKERBOARD RENDERING IMPLEMENTATION 

• By offsetting the projection matrix again each frame we are able to alternate the pattern 

• We don’t always have access on PC to change sample locations 
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Even frames 


Odd frames 



FILLING IN THE BLANKS 


• To reconstruct colors for unknown pixels P and Q, we sample 

• Current frame direct neighbors linear-Z 

• Current frame direct neighbors color 

• History color and Z 
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Even frames 


Odd frames 



HISTORY COLOR/Z 


• One neighbor gets picked for motion velocity: 

• Closest one to the camera to preserve silhouette 

• With motion velocity we sample the previous resolved color 

• That way we get to use filtering, but introduces accumulation errors! 

• We clamp the re-projected color with A B E F for Q 

• Using previous depth computed from motion we compute a confidence value 

• Used to blend back toward the unclamped value. 
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Odd frames 




RESOLVED COLOR 


• Having: 

• The history color 

• The interpolated color from direct neighbors 

• A final color is computed using two additional weights: 

• Color coherency: 

• Minimum difference between A B E F for Q 

• Magnitude of velocity 



Even frames 
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Odd frames 




COMPLETE FLOW 

• Resolve quite complex 

• Lots of tweaks for our content! 

• Costs 1.4ms 


8 -10ms net win 


Previous Scene Color 


Current Scene Color 


Previous Motion Vector 


Current Motion Vector 


Previous Linear Depth 


Current Linear Depth 


Calc Current Min Z 
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CL Min-Z Offset 



cr Current Color YCoCg 
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Interpolate current neighbours 
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cr Current Interpolated Color 
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Depth Occlusion Test 
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i Motion Coherency Test ! 




YCoCg A ABB Clamping 
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cr Z-Confidence C_. Motion Coherency 


Cl Clamped Prev Color 


Ghosting- 

Flickering++ 
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Compute final confidence 


Ghosting++ 




Deinterlaced Render Target 




T-AA 

• Integrates with the checkerboard rendering 

• Can be run on the same resolve shader 

• Done on the sub-sample level, MSAA 4X style jitters on top of the 
checkerboard pattern 

• Reprojectged color weight uses similar logic 


Additional “Unteething” used to remove bad checkerboard patterns 


TEETH REMOVAL FILTER 


• Resolve can introduce noticeable sawtooth patterns in the image 

• We apply a filterto remove them 

• The filter works on 5 horizontally or vertically adjacent pixels 


• We setup a threshold z/and binaries pixel each to 
0 or 1 if they fall in the range of [0, d\ or [1 - d, 1] 

• We detect a 01 01 0 or 1 01 01 pattern 
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FUTURE DEVELOPMENTS 


• Checkerboard technique was a good win for us 

• We are going to push more quality per pixel and build up on it 

• Implementation mostly by trial and error, we will move to a more 
scientific approach on the different confidence weights and values used 
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BONUS 

SLIDES 



GBUFFER LAYOUT 

4 Render Targets (RGB1 0A2 + 3 * RGBA8) + Depth sStencil (D32 - S8) 



CONFIG 

ALIASED VALUE 

Default 

Self AO (sA8) 

Skin 

Skin SSS Mask (sA8) 

Translucent 

Translucence (sA7) + Back Face (Al) 








GBUFFER RENDERING 

• We use inverted depth combined with a D32 float for better uniform depth precision distribution 

• For normals we experimented with BFN first 

• We moved to a R10G10B1 0A2 format to save VGPRs and ALU 

• Velocity vector is 3D and enjoys a higher precision on the X & Y axis to support our temporal 
reprojection rendering 

• GBuffer Layer 2’s alpha is aliased depending on the material type 

• Self-AO was not used since SSBC revealed itself sufficient most of time 

• We apply a higher SSAO factor on the first person character 


GBUFFER RENDERING 



LIGHTING -Gl 

• Gl is static and is based on a simplification of Assassin’s Creed Unity Gl 

• Low resolution volume covering whole map: 

• Sky visibility SH 

• lm to 2m per voxel 

• High resolution volume covering the playable area: 

• Sky visibility SH 

• Bounce color SH 

• 25cm per voxel 


Screenshot of low res Volume 


Screenshot of high res Volume 






near 


LIGHTING -DIRECT 

• We generate a clustered structure on the frustum: 

• 32x32 pixels based tile 

• Z exponential distribution 

• Hierarchical culling of light volume to fill the structure 



• Light cookies (gobos) are gathered in an array to be able to fetch them dynamically 

• Simply part of the light data as indices in an array 


LIGHTING -SSR 

• Done in ] /4 resolution 


• Uses face normal to give ray direction 

• Temporal reprojection with light accumulation (ray-based, not depth based) 

• Linear marching, steps gloss dependent 

• Jitter start ray position and direction 

• Temporal reprojection smooth the results 

• Invalidate previous frame result on camera movement 


LIGHTING -REFLECTION 


• Local cubemaps 

• Parallax corrected 

• Regarded as lights, volume injected in clustered lighting structure 

• Reside in cubemap array for easy access 

• Cubemaps applied during SSR application 

• Local cubemaps are SSR’s primary fallback 

• Global cubemap is secondary fallback 


Screenshot showing cubemap volumes 



LIGHTING - FORWARD 


• Support same set of features as the deferred pass: 

• All shadows, cubemaps, cookies are in texture arrays 

• VGPR consumption issues: 

• Scaling down on the quality of shadow filtering 

• Glass disables some lights types 

• Still lowest occupancy in our renderer 

• Expensive particles use the ESM version of the shadow cache 


SCHEDULING 


• Graphic thread managing work queues and stealing work when necessary, work stolen gets 
executed on the immediate context when possible to minimize overhead. 


On PC no draw calls are recorded we let the material based draw call pipeline handle the scaling. 


On consoles graphics work has priority on Cluster 0 - Core 0, 1 , 2 and we also maintain cluster 
locality when scheduling tasks. 

• Fork & join work can take a turn for the worst when hammering shared atomics, (add numbers) 


SCHEDULING 


• Rendering-specific scheduler on top of the engine scheduler: 

• Full control of graphic task behavior to fit in our budgets 

• Task dependencies code defined 

• Investing on visualisation tools would have been worth while 

• First implementation used system fibers 

• Workers can steal a job with more priority instead of waiting 

• Fibers confusing to programmers 

• Some systems have trouble displaying them properly in the debugger 

• We moved to a simpler model where yielding just executes a new job on the current context 


GRAPHIC CPU PERFORMANCES ON RAINBOW 


• Beside from initialization, zero tolerance global allocator usage during the frame 

• Heavy use of per worker thread local allocator 

• Resets when outermost jobfinishes 

• Helps on cache locality and more flexible 

• Heavy use of pooling 

• Dangling pointers becomes harder 

• Adding memory state values on builds to check validity 


• Memory access patterns were 95% of the optimization works on the graphic side 

• Per thread gather lists are used to decrease inter thread communication 

• Atomics have an important cost if not used properly 


SLI SUPPORT 


• Driver tracking disabled on all resources 

• Simple scoping interface for update of resources that need sync 

• One line addition to the code when necessary 

• Update of a couple of large buffers was implemented by propagating the 
changes manually on each GPU 

• A lot more efficient than synching buffers 

• Update of unified constant buffers takes a couple of ps, copy brakes scaling 


